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ABSTRACT 

Wb describe in (his paper a code-excited linear predictive coder 
in which the optimum innovation sequence is selected from a code 
book of stored sequences to optimize a given fidelity criterion. Each 
sample of the innovation sequence is filtered sequentially through two 
time-varying linpar recursive filters, one with a long-delay {related to 
pitch period) predictor in the feedback loop and the other with a 
short-delay predictor (related to spectral envelop^ in the feedback 
loop. We code speech, sampled at 8 kHz, in bloolcs of 5-mseo dura- 
tion. Each block consistbg of 40 samples is produced from one of 
1024 possible innovation sequences. The bit rate for the innovation 
sequence is thus 1/4 bit per sample. We compare In this paper 
sevBial different random and deterministic code books for their 
effectiveness in providing the optimum innovation sequence in each 
block. Our results indicate that a random code book has a slight 
speech quality advantage at low bit rates. Examples of speech pro- 
duced by the above method will be played at the conference. 



INTRODUCnON 

Performance of adaptive predictive coders for speech signals 
using instantaneous quantizers deteriorate rapidly at bit rates below 
about 10 kbits/ sec. Our past work has shown that high speech quality 
can be maintained in predictive coders at lower bit rates by using 
non-instantaneous stochastic quantizers which minimize a subjective 
error criterion based on properties of human auditory perception. [ 1]. 
We have Used tree search procedures to encode the innovation Signal 
and have found the tree codes to perform very well at 1 bit/sample (8 
kbits/sec), The speech quality is maintaincti even at 1/2 bit/sample 
when the tree has 4 branches at every node and 4 white Gaussian ran- 
dom numbers on each branch [2l. 

The tree search procedures are suboptimal and the performance 
of tree codes deteriorates significantly when the innovation signal is 
coded at only 1/4 bit/sample <2 kbits/sec), Such low bit rates for the 
innovation signal are necessary to bring the total bit rate for coding 
the speech signal down to 4.8 kbits/sec - a .-ate that offers the possi- 
bility of carrying digital speech over a single analog voice channel, 

Pehn and Noll 13) have discussed merits of various multipath 
search coding procedures: code-book coding, tree coding, and trellis 
coding, Code-book coding is of particular interest at very low bit 
rates. In code-book coding, the set of possible sequences for a block 
of innovation s^gnal is stored in a code book. For a given speech seg- 
ment, the optimum innovation sequence is selected to optimize a given 
fidelity criterion by exhaustive search of the code book and an index 
specifying the optimum sequence is transmitted to the receiver. In 
general, code-book coding is impractical due to the large size of the 
code books. However, at the very low bit rates we are aiming for, 
exhaustive search of the code book to find the best innovation 
sequence for encoding short segments of the speech signal bBOOmcS 
possible [4]. 



SPEECH SYNTHESIS MODEL 

The speech synthesizer in a code-excited linear predictive coder 
is identical to the one used in adaptive predictive coders [1], ft con- 
sists of two time-varying linear recursive filters each with a predictor 
in its feedback loop -as shown in Fig. i. The first feedback loop 
includes a long-delay (pitch) predictor which generates the pitch 
periodicity of voiced speech. The second feedback loop includes a 
short-delay predictor to restore the spectral, envelope. 
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Fig. 1. Speech synthesis model with short and long delay 
prcdictoni. 



The two predictors are detennined using procedures outlined In 
References 1 and 5. The short-delay predictor has 16 coefficients and 
these are dotennined using the weighted stabilized covariance method 
of LPC analysis [1,5] once every 10 msec. In this method of LPC 
analysis, the mstantaneous prediction error is weighted by a Hamming 
window 20 msec In duration and the predictor coefiiciBnts are deter- 
mined by minimizing the energy of the weighted error. The long- 
delay (pitch) predictor has 3 coefficients which are detennined by 
minimidng the mean-squared prediction error after pitch prediction 
over a timo interval of 5 msec [2], 

SELECTION OF OPTIMUM INNOVATION SEQUENCE 

Let US consider the coding of a short block of speech signal S 
msec in duration. Each such block consists of 40 speech samples at a 
sampling frequency of 8 kHz, A bit rate of I/4 bit per sample 
corresponds to 1024 possible sequences (10 bits) of length 40 for each 
block. The procedure for selecting the optimum seqiwnce is Illus- 
trated in Fig. 2. Bach member of the code book provides 40 samples 
of the Innovation signal Bach sample of the innovation signal is 
scaled by an amplitude factor that is constant for the 5 msec block 
and is reset to a new value once every 5 msec. The scaled samples 
arc filtered sequentially through two recursive filters, one for introduc- 
ing the voice periodicity and the other for the spectral envelope. The 
regenerated speech samples at the output of the second filter are com- 
pared with the corresponding samples of the original speech signal to 
form a difference signal. The difference signal representing tlic-objec- 
tivB error is further processed through a linear filter to attenuate those 
frequencies where the error is perceptually less important and to 
amplify those frequencies where the error is perceptually more impor- 
tant. The transfer function of the weighting filter is given by 
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silence regions to voiced speech. 



(1) 



wbere at are the short-delay predictor coefficients, p='16 and a is a 
parameter for controlling the weighting of the error as a function of 
frequency , A suitable value of a is given by 



(2) 



where /, is the sampling fitquency. The weighted mean-squared 
error is determined by squaring and averaging the error samples at 
the output of the weighting filter for each 5-nnseo block. The 
optimum innovation sequence for each trfock is selected by exhaustive 
search to minimize the weighted error. As mentioned earlier, prior to 
filtering, each sample of the innovation sequence is scaled by an 
amplitude factor that is constant for the S-msec block. This ampli- 
tude factor is determined for each code word by minimizing the 
weighted mean-squared error for the block, 
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Fig. 2. Block dii^ram illustrating the procedure for selecting 
the optimum innovation sequence. 



CONSTRUCTION OF OFOTWUM CODE BOOKS 

A code book, within the limitation of its size, should provide as 
dense a sampling as possible of the space of innovation sequences. In 
principle, the code words could be block codes that are optimally 
placed on a hypersphere in the 40-dimensional space (representing 40 
samples in each 5-mS8C block), Fehn and Noll [3] have argued that 
random code books (code books with randomly selected code words) 
are less restrictive than deterministic code books Random code books, 
in some sense, provide a lower (sound for the performance at any given 
bit rate. A deterministic code book, if properly constructed, shotild 
provide a performance that is at least equal to - if not better than - 
that of the random code books and the deterministio nature of the 
code book should make it easier to find the optimum hinovatlon 
sequence for each block of speech. However, it is generally very 
difficult to design an optimum deterministic code book. 

As a start, we have chosen a random code book in which each 
possible code word is constructed of white Gaussian random numbers 
with unit variance, Wc have chosen the Gaussian distribution since 
our earlier work has shown that the probability density function of the 
prediction error samples (after both short-delay and long-delay predic- 
tions) is nearly Qaussian [ll. Rgnre 3 shows a plot of the first-order 
cumulative amplitude distribution function for the prediction residual 
samples and compares it with the corresponding Gaussian distribution 
function with the same mean and variance. A closer examination of 
the prediction error shows that the Gaussian assumption is valid 
almost everywhere except for stop bursts of unvoiced stop consonants 
and for a few pitch periods duiing the traositlon from unvoiced or 




Fig. 3, First-order cumulative probability distribution function 
for the prediction residual samples (solid curve). The 
corresponding Gaussian distribution function with the same 
nwan and variance is shown by the dashed curve. 

Each sample v, of the innovation sequence in a Gaussian code 
book can be expressed as a Fourier series of N cosine functions 
(W-20): 



V, - fact aosUkn/N + (^fe), n=0,l 2N-1. 



(3) 



where c/^ and are independent random variables, <j>it is uniformly 
distributed between 0 and 2v, and c^ is RaylMgh distributed with pro- 
bability density function 

pWt) - f*exp(-CiV2), Ci>0, (+) 

The function of the innovation sequence in the syntheus model of 
Fig. 1 is to provide a correction to the filter output in reprodniang the 
speech waveform within the lUnitation of the size of the code book, ■ 
Using the Fburler series model of Eq. (3), the coirBCtion can be con- 
sidered separately fbr the amplitude and phase of each Fourier com- 
ponent. Do we need both amplitude and phase corrections for high- 
quality speech synthesis? Are the two types of corrections equally 
important? These questians can be answered by restricting the varia- 
tions in the amplitudes and phases of various Fourier components in 
Eq, (3). For example, a code book can be formed by setting the 
amplitudes to a constant value and by keeping the phases uni- 
formly distributed between 0 and 2r, Another code book is formed 
by setting the phases to some constant set of values and by keeping 
the amplitudes Rayleigh distributed in accordance with Eq. (4), 

We have abo used a code book in which the different innovation 
sequences are obtained directly from the prediction error (after nor- 
malizing to unit variance) of speech ngnals. The amplitudes and 
phases are no longer distributed according to Rayleigh and uniform 
density fiinctions, respectively, but reflect the distributions represented 
in the actual prediction error. 

RESULTS 

As we mentioned earlier, the random code book provides a base 
line against which we can compare other code books. We have syn- 
th^ized sevei'al speech utterances spoken by both male and female 
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speakers (pitch frequencies ranging from 80 Hz to 400 Hz) using the 
different code books discussed in the previous sectioni The random 
code bools (with 1024 code words) provided unexpectedly good perfor- 
mance. Even in close pair-wise comparisons over head phones, only 
occasional small dilTerences were noticeable between the original and 
synthetic speech utterances. These results suggest that a lO-bit ran- 
dom code book has sufSoient flexibility to produce high-quality speech 
from the synthesis model shown in Fig. 1. 



The waveforms of the original and synthetic speech signals were 
found to match closely for voiced speech and reasonably well for 
unvoiced speech. The signal-to-noise ratio averaged over several 
seconds of speech was found to be approximately 15 dB. Examples of 
speech waveforms are shown in Fig. 4. The figure shows (a) original 
speech, (b) synthetic speech, (o) the LPC prediction residiial, (d) the 
reconstructed LPC residual, (e) the prediction residual after pitch 
prediction, and (f) the coded residual Irom a 10-bit random code 
book. As expected, the Gaussian code book is not able to reproduce a 
sharp impulse m the coded residual waveform, The absence of the 
sharp impulse produces appreciable phase distortion in the reccai- 
structed LPC prediction residual. However; this phase distortion is 
mostly limited to frequency regions outside tte formants. 
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Fig. 4. WavBftormi of different signals In the coder: (a) the 
ori^nal .speech, (b) the synthetic speech, (c) the LPC 
prediction residual, (d) the recanstruoted LPC residual, (e) 
the prediction residual after pitch prediction, and (0 the coded 
residual from a 10-blt random code t>ook. Waveforms (c) and 
(d) are ampUSed S times relative to the speech signal. 
Waveforms it) and (f) are ampb'Sed by an additional factor 
of 2. 



We have also examhied the distribution of the reconstruction 
error amongst various cede words, figure S shows a plot of tlie 
number of code words which produced a given amount of rms error in 
a particular S-msec block of speech. The behavior shown Is typical 
of what we observed in several blocks. The minimum rms error for 
this blo<* was 30 and only 5 code words (out of a total of 1024) pro- 
duced an rms error less than 33, This indicates that the. size of the 



code book cannot be reduced significantly without producing Substan- 
tial increase in the error. 




Fig, 5. Distribution of error amongst the various code words in 
a Gaussian code book. 



■ Due to the random nature of code books, diiferent Gaussian code 
books produced different innovation sequences. However, we did not 
hear any auditle difference between the speech signals reoonstrucled 
from these different code books. Figure 6 shows several examples of 
the innovation sequences selected from several different Gaussian code 
books for one S-msec block, The innovation sequences for other previ- 
ous blocks were kept the same; thus, the filter coefficients and th4 
filter memories were identical at the beginning of the block. The 
coded innovation sequences show very little similarity to each other. 
The amplitude spectrum for the different sequences is shown in Fij, 
6(b), Again, there is no obvious common pattern amongst the 
different amplitude spectra. The corresponding phase responses are 
shown Fig. 6(o). 
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Fig, 6. (a) Wav^fcans of different innovation sequences for a 
particular S-msec block; (b) amplitude spectra of innovation 
sequences, and (c) phase responses of innovation sequences. 

The code book with constant amplitude but uniformly distributed 
phases performed nearly as well as the Gaussian code book. The 
slgnal-to-nolse ratio decreased by about I.S dB and there was an audi- 
ble difference between the two code books. The code book with 
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constant phases but Rayleigh-distribiited amplitudes performed vary 
poorly, both in the signal-to-nois6 ratio and in listening to syntheUo 
speech. The qode book based on the prediction residual signals 
derived from' speech perfo'rmed as well as the Gaussian code book. 

CONCLUDING REMARKS 

Our present work with the code-excited linear predictive coder 
has demonstrated that such coders offer considerable promise for pro- 
ducing high quality synthetic speech at bit rates as low as 4,8 
kbits/seo, The random code book we have used so far obviously does 
not provide the best choice. The proper design of the code book is the 
key to success for achieving even lower bit rates than we realized in 
this study. We have so far employed a fixed code book for all speech 
data. A fixed code book is somewhat wasteful. Further eJHoiency 
could be gained by making the code book adaptive to the time-varying 
linear filters used to synthesize speech and to weight the error, The 
coding procedure is computationally very expensive; it took 125 sec of 
Cray-I CPU time to process 1 sec of the speech signal. The program 
was however not optimized to run on Cray. Most of the time was 
taken up by the search for the optimum innovation sequence. A code 
book with sufficient structure amenable to fast search algorithms 
xmld lead to real time implementation of oode-excited coders, 
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